Section 2 Data
2.1 Data Sources
Data used in this project comes from two sources, namely:
The Australian Electoral Commission (AEC) (Commission 2023b) . The national body overseeing and running federal elections, the AEC contains detailed election result records. All results for federal elections held in the 21st century are available online, through their Tally Room website (Commission 2023a).
The Australian Bureau of Statistics (ABS) (Statistics 2023b). The ABS provides a wide number of national statistics and is responsible to conduct a national census of population and housing every 5 years. Comprehensive census data is provided in multiple formats, including CSV files through Census Data Packs (Statistics 2023a), available for censuses from 2006 onwards.
Both organisations are the authoritative source for electoral and statistical data in Australia, and the data is provided openly. Although there are no quality issues, the way that data is provided presents other challenges:
- In both cases, data are provided in large volumes and exhaustive granularity. Data extraction and aggregation can be time-consuming and resource-intensive if not done effectively.
- Census data points are provided using the ABS own geographical standard - and only a small selection of census data is provided already aggregated for each Commonwealth Electoral Division. Conversion between ABS geographical structures and electoral divisions is not straightforward as there is no 1:1 correspondence. Both geographical reference systems are modified at each election and each census.
- Despite the best efforts of both organisations in keeping consistency, names of electorates, parties, and census attributes change over time, which requires keeping track of all those changes and mapping them accordingly.
To assist in dealing with these issues and ensure repeatability, it was necessary to write code to guarantee some level of repeatability and consistency when extracting and transforming data. This resulted in three R packages being written to undertake this task:
- {auspol} (Yáñez Santibáñez 2023b), which extracts and presents electoral results.
- {auscensus} (Yáñez Santibáñez 2023a), which allows to interact with Census Data Packs to extract different statistics across geographical units, and across censuses.
- {aussiemaps} (Yáñez Santibáñez 2023c), which assists with aggregating census data into electoral divisions, by matching and apportioning different geographical structures.
The way each package operates is described on their respective websites Using them, it was possible to build a basic data extraction and transformation pipeline, which is represented by figure 2.1.
Figure 2.1: Flow of data from sources to dataset
In four steps, the extraction process consists of:
Census data was extracted from the respective Census Data Pack using {auscensus}. Using the package workflow, key attributes were identified in each census, extracted from the respective files and given common names. Data were extracted for statistical areas and apportioned into Commonwealth Electoral Divisions by overlapping area, with the help of functions written into {aussiemaps}
Primary vote results for each division were extracted using the {auspol} package.
All the data was stored in a local database, from where it was extracted and put together in a single dataset.
From there, the “raw” data was further processed and stored in a single “consolidated” dataset. This dataset has been further refined throughout the data exploration and modelling steps.
2.2 Data Selection and Initial Transformation
There were a number of considerations that were taken to obtain the dataset that was eventually used, including what and how to represent the statistics and how to best align census and election data.
2.2.1 How to present the numbers
The first point to consider was how to represent the data in a way that is consistent across electorates and time. Although the aim behind the creation and geographical distribution of Commonwealth Electoral Division is to provide equal representation in Parliament for every Australian, this is not completely possible in practice, resulting in electorates varying in population (between 72,345 and 138,836 voters). This is mainly due to the large variation in population density across Australia, combined with a constitutional mandate to guarantee a minimum number of seats per state or territory. For this reason, it is deemed necessary to represent all voting and demographic statistics as a percentage of each electorate’s roll or population. This is also useful when comparing statistics across time.
The second point to address is the correspondence between census and election data. Since the election the census cycle (5 years) does not match the electoral cycle (determined by the incumbent government, with a 3-year term for the House of Representatives), there is a potential problem of the census data not being completely representative of the population on a given election day. Figure 2.2 presents the best matches between both events held in the 21st century.
Figure 2.2: Census and Elections Timeline
Considering the census data available and selecting the elections closer to each census, four sets of events were selected for data extraction. there are presented in table 2.1.
| Census | Election |
|---|---|
| 2006 | 2007 |
| 2010 | 2011 |
| 2016 | 2016 |
| 2021 | 2022 |
Please note that this selection will remove half of the election events within the period, which may affect model accuracy. However, since the objective is not to obtain an accurate prediction this has been accepted as an acceptable trade-off to avoid, instead of having to interpolate demographic statistics.
2.2.2 Electoral Data
In the case of the electoral data. not much processing was required. The source data already contains records of primary voting for each electorate. The only adjustment was to reclassify the vote into four groups (referred to as parties in this document):
- ALP for the Australian Labor Party.
- COAL representing the Coalition, made of the Liberal Party, the National Party, the Liberal National Party of Queensland 1 and the Country Liberal Party in the Northern Territory 2 .
- GRN for the Australian Greens.
- Other to collect votes from any other candidates, including minor parties and independents.
A data sample is presented in table 2.2.
| Year | Division | Abbreviation | Party | Votes | Percentage |
|---|---|---|---|---|---|
| 2022 | Canberra | ALP | Australian Labor Party | 34,574 | 45.20 |
| 2022 | Canberra | GRN | The Greens | 19,240 | 25.15 |
| 2022 | Canberra | COAL | Liberal (Coalition) | 16,264 | 21.26 |
| 2022 | Canberra | Other | Other Parties | 6,417 | 8.39 |
2.2.3 Census Data
When it comes to Census data, a number of considerations had to be tackled during extractions, namely:
Large volumes of data. Each census collected a large number of statistics. For instance, the data release for the 2022 Census contains 62 different tables, ranging from 8 3 to 1,590 4 attributes.
Data aggregated per electorate. Although the ABS provides statistics for non ABS geographical structures, this only includes a subset of all data points collected. Thus, in many cases is necessary to extract data for granular-level ABS units (SA1 in 2022) and aggregate them into electoral divisions. Without knowing the population density for each SA1, values have been approximately apportioned using areas.
Consistency across time. Due to the changing nature of a Census (to better serve its purpose), there are some minor variations in how data is collected and aggregated from Census to Census.
To obtain a first selection of potentially relevant demographic variables to extract, existing literature and journalistic sources were consulted ((Biddle and McAllister 2022), (Parliament, n.d.), (Jakubowicz and Ho, n.d.a)). Since many variables are colinear by definition (e.g. income groups) or they are closely related (e.g age and relationship status), the initial selection was inspected. After iteration, a resulting set of 55.00 attributes was chosen, which can be classed into the following categories:
Income: Distribution of the population in pre-set income brackets. The highest income bracket includes everyone earning 2,000 dollars or more each week.
Education Level: Distribution of educational achievement (from incomplete secondary to vocational education and academic degrees).
Age: Year of birth is captured in the census, which was grouped into generational cohorts. The four groups of interest are Baby Boomers (1946 to 1964), Generation X (1965 to 1980), Generation Y (1981 to 1996) and Generation Z (1997 to 2021).
Relationship status: Variables describing civil status (e.g. living alone, married, in a de facto relationship).
Household type: Descriptors of type of housing, (e.g. standalone house, semi-detached, flats).
Household tenure: Descriptors of house ownership, rental or another arrangement (e.g. public housing).
Citizenship: Percentage of the population that hold Australian citizenship. Although non-citizens are not entitled to vote, this variable can be taken as a proxy for the relative integration of migrant communities into civic life.
Religion: Percentage of the population declaring to profess a religion. For this analysis, large and high-growth religious groups were selected. For practical reasons and to use as a potential community proxy, the values of Anglican, Presbyterian and Uniting followers were merged into a single statistic.
Language: Languages spoken in the community. Similar to religion, a selection of relevant languages have been included to reflect the historic and current migrant communities.
Additionally, each electorate was classified as metropolitan if it lies within the boundaries of Australian capital cities or non-metropolitan, when it is not the case. Altogether, these variables try to reflect wealth and education (cited by (Biddle and McAllister 2022) as key factors influencing political persuasion), as well as the stage in life and belonging to a particular migrant community (sometimes cited as an influential factor, for instance in (Jakubowicz and Ho, n.d.b)).
A sample of the resulting dataset is present in table 2.3.
| Election Year | Division | Australian Citizens | Baby Boomers | Gen Y | Rented | Chinese | Italian | Single Parent | 2000 or_more |
|---|---|---|---|---|---|---|---|---|---|
| 2010 | Banks | 82.17 | 16.42 | 22.92 | 22.59 | 20.40 | 1.33 | 4.50 | 5.51 |
| 2016 | Barton | 72.40 | 15.39 | 29.13 | 33.65 | 17.39 | 1.94 | 4.16 | 6.88 |
| 2007 | Bennelong | 81.36 | 25.67 | 21.91 | 26.87 | 15.33 | 2.48 | 3.70 | 5.94 |
| 2022 | Brand | 82.72 | 16.50 | 23.58 | 22.65 | 0.64 | 0.23 | 5.32 | 12.43 |
| 2010 | Brisbane | 79.23 | 12.18 | 33.43 | 43.57 | 2.55 | 1.34 | 3.12 | 13.64 |
| 2007 | Chisholm | 78.60 | 23.10 | 23.71 | 24.29 | 12.69 | 2.61 | 3.96 | 4.16 |
| 2010 | Dobell | 90.24 | 18.57 | 18.72 | 23.68 | 0.41 | 0.44 | 5.87 | 3.58 |
| 2007 | Flinders | 86.81 | 26.55 | 19.22 | 17.12 | 0.14 | 1.10 | 4.29 | 1.96 |
| 2007 | Fowler | 87.37 | 25.65 | 26.02 | 24.54 | 10.30 | 2.60 | 6.64 | 0.55 |
| 2010 | Fowler | 85.32 | 14.23 | 21.64 | 20.04 | 9.04 | 2.51 | 6.50 | 1.50 |
| 2007 | Grey | 90.54 | 27.02 | 19.59 | 17.97 | 0.07 | 0.48 | 4.32 | 1.52 |
| 2022 | Herbert | 86.80 | 18.02 | 22.99 | 33.05 | 0.40 | 0.29 | 5.68 | 9.59 |
| 2016 | Hume | 90.02 | 19.60 | 19.36 | 16.83 | 0.45 | 0.53 | 4.29 | 7.16 |
| 2007 | Kennedy | 88.60 | 26.56 | 21.24 | 24.93 | 0.12 | 2.33 | 4.19 | 2.05 |
| 2010 | Lindsay | 87.74 | 15.43 | 23.88 | 22.92 | 0.77 | 0.72 | 5.68 | 3.47 |
| 2022 | Longman | 86.78 | 22.03 | 20.08 | 28.38 | 0.64 | 0.14 | 5.63 | 6.46 |
| 2016 | Lyne | 90.72 | 29.70 | 12.85 | 19.68 | 0.15 | 0.09 | 4.85 | 3.83 |
| 2016 | Maranoa | 88.03 | 23.18 | 17.49 | 28.28 | 0.22 | 0.25 | 4.32 | 4.38 |
| 2022 | McEwen | 88.79 | 17.70 | 22.29 | 15.45 | 0.68 | 1.01 | 4.04 | 13.11 |
| 2016 | Newcastle | 88.84 | 17.95 | 23.04 | 28.57 | 1.68 | 0.47 | 4.94 | 8.45 |
| 2022 | Nicholls | 86.66 | 24.49 | 18.20 | 20.20 | 0.79 | 0.96 | 4.79 | 6.09 |
| 2010 | Paterson | 91.28 | 24.06 | 16.26 | 22.56 | 0.19 | 0.19 | 4.98 | 3.93 |
| 2007 | Stirling | 83.23 | 24.55 | 20.78 | 25.78 | 2.01 | 4.60 | 4.87 | 3.68 |
| 2022 | Werriwa | 84.16 | 16.05 | 22.66 | 22.13 | 3.10 | 1.46 | 5.12 | 7.17 |
| 2007 | Wills | 83.85 | 21.61 | 20.05 | 26.15 | 2.23 | 11.52 | 4.55 | 2.18 |
2.3 Training, Validation and Testing Split
After obtaining the data, the election results and census statistics for the 2021/2022 cycle were set aside, since they have been used as testing dataset, in a election forecast attempt. The remaining data has been used in exploratory analysis, data mining and creating and fitting models.
2.4 Data Exploration
In total, the resulting dataset is made up of 4 response variables and 55.00 potential predictors, plus identification attributes like division name and election year. As expected the many covariates exhibit moderate to high collinearity. Also, it is possible to observe some loose correlation between some of the covariates and some of the responses
As an example, figure 2.3 shows a somewhat weak correlation between Coalition primary vote and the percentage of the Baby Boomers. Figure 2.4 presents the correlation values for religion and language variables, where is possible to see: * A positive correlation between monolingual English speakers and membership in Anglican, Presbyterian and Uniting churches. Together, they are likely proxies for Anglo-Celtic population. * Similarly, there are somewhat expecting origins that most likely indicate concentrations of linguistically and culturally diverse pockets, e.g. Hinduism and South Asian languages, Catholicism and Italian, and Buddhism and East Asian languages.
Figure 2.3: Correlation between Coalition vote and Baby boomer population
Figure 2.4: Correlation for selected covariates
Additionally, after a detailed inspection, it is worth noticing that :
There is no apparent change in the relationship between a given covariate and the responses when broken down by state or capital city.
There are no obviously distinguishable differences when splitting results by each election.
2.4.1 Dimensionality reduction using Multiple Factor Analysis
Given the large number of colinear covariates, it is worth exploring if a change of space could help to better measure variation in a meaningful way and in a more manageable number. To achieve this, **multiple factor analysis* (MFA) (Escofier and Pagès 2008) was used as the clustering algorithm. MFA is essentially an extension of Principal Component Analysis that can deal with variables that belong to groups (like this case). It can also combine quantitative and qualitative variables (such as belonging to a metropolitan area).
The resulting scree plot and cumulative variance are presented in figure 2.5.
Figure 2.5: Scree plot and cumulative variance
Figure 2.6 presents group biplots for the 8 most important dimensions. Unfortunately, there is no straightforward representation except with Dimension 2 and Education variables.
Figure 2.6: Group plots for first 8 dimensions
2.4.2 Electorate segments
Normally when characterising votes, Australian politicians and political media make a distinction between inner-city voters (touted as wealthy and progressive), suburbia (“middle Australia”), and the bush and outback areas (conservative, “battlers”, “real Australia”). Therefore, it is of interest to explore if this can be substantiated by demographic attributes, as it may have an impact on primary voting.
Using all demographic variables a clustering algorithm has been applied to identify those clusters. Different clustering approaches were, eventually choosing to:
- ignore Census years and pool all records in a single pool.
- transform all demographic attributes to represent the difference between each data point and their corresponding national value (in the same year).
- use HDBSCAN (Campello, Moulavi, and Sander 2013), a density-based hierarchical clustering algorithm. Instead of pre-setting a target number of clusters, HDBSCAN determines the optimal number of clusters based on its tuning parameters.
This results in 3 distinct clusters of electorates. When presented in a map, it is possible to obtain figures 2.7 and 2.8 for 2016.
Figure 2.7: Clusters for 2016 Election
Figure 2.8: Clusters for 2016 Election - Interactive
These three clusters are:
cluster 0 seems to mostly contain electorates located in the inner cities, especially in Sydney and Melbourne. These areas tend to be more affluent, either “established” or “gentrified” suburbs. Notably, it also contains the three northernmost, remote electorates.
cluster 1 comprises all regional areas outside state capitals (with the exception of Hobart in Tasmania).
cluster 2 largely represents “suburbia”. It is also more prevalent in Brisbane and Perth compared when comparing capital cities.
Revisiting the demographic attributes can help to understand how these clusters differ from each other. A selection of those variables is presented in figure 2.9.
Figure 2.9: Selected attributes, coloured by cluster.
Even though it is possible to find electorates from every country across the spectrum for every attribute, it is possible to observe that cluster 0 tends to concentrate areas with significant Millenial, highly educated, and relatively affluent populations. These areas also tend to attract newer migrants (lower numbers of citizens) and therefore they possess higher percentages of multicultural populations (such as Chinese speakers). Cluster 1 tends to concentrate older people, with lower percentages of tertiary and vocational education and possibly higher proportions of Anglo-Celtic Australians. Cluster 2 seems to be sitting in the middle of the other two clusters. Adding these findings to the geographical locations seems to confirm there is some element of truth in the stereotypical classification of voters.
In Queensland and the Northern Territory, the Liberal and National branches have merged. Elected federal MPs and senators sit with Liberals if they come from an urban area, or the Nationals when they represent a regional/rural/remote electorate.↩︎
In Queensland and the Northern Territory, the Liberal and National branches have merged. Elected federal MPs and senators sit with Liberals if they come from an urban area, or the Nationals when they represent a regional/rural/remote electorate.↩︎
02 -Selected Medians and Averages↩︎
09 - Country of Birth of Person by Age by Sex↩︎